STAR (Dobin et al., 2013) is a powerful RNA-seq aligner tool that will map raw RNA-seq reads to a reference genome. STAR can accept single-end or paired-end read (sequence) files as input. Ideally, this will be in FASTQ file format, but FASTA file format can also be accepted by STAR.
|
Single-end FASTQ Examples
|
Paired-end FASTQ Examples
|
||
|---|---|---|---|
| sampleA.fastq.gz | sampleA.fq.gz | sampleA_r1.fastq.gz | sampleA_r1.fq.gz |
| sampleB.fastq.gz | sampleB.fq.gz | sampleA_r2.fastq.gz | sampleA_r2.fq.gz |
| sampleC.fastq.gz | sampleC.fq.gz | sampleB_r1.fastq.gz | sampleB_r1.fq.gz |
| sampleB_r2.fastq.gz | sampleB_r2.fq.gz | ||
| sampleC_r1.fastq.gz | sampleC_r1.fq.gz | ||
| sampleC_r2.fastq.gz | sampleC_r2.fq.gz | ||
Publicly available RNA-seq data of this format can be found using the NCBI sequence read archive (SRA); the NCBI Gene Expression Omnibus (GEO); and other databases. Data coming from the NCBI SRA are conventionally written with the prefix “SRR”. Download this data to any desired folder on SickKids’ HPC cluster. To do this, use wget or curl or another non-interactive download tool supporting HTTP, HTTPS, and FTP protocols.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/005/SRR7506545/SRR7506545_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR750/005/SRR7506545/SRR7506545_1.fastq.gz -o SRR7506545_day00_RNA_sequencing_Rep1_hESC_1.fastq.gz
To conserve disk space, leave the raw sequencing data compressed (.gz). STAR can use compressed files.
Perform a quality check of the downloaded reads using FastQC (Andrews, 2010) and determine whether any pre-processing needs to occur by checking the output HTML report for each read.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load fastqc/0.11.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
INPUT=/path/to/raw/seq/reads/folder
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/qc/output/folder
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
fastqc -o $OUTPUT $INPUT/$FILE.$READFORMAT.gz
done
#######################################
Execute this shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=50g,vmem=50g
#PBS -m e
#######################################
# DO NOT EDIT
module load fastqc/0.11.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
INPUT=/path/to/raw/seq/reads/folder
END1= # paired-end 1 file suffix; e.g., _1, _r1, _R1, .1, .r1, .R1, etc.
END2= # paired-end 2 file suffix; e.g., _2, _r2, _R2, .2, .r2, .R2, etc.
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/qc/output/folder
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
fastqc -o $OUTPUT $INPUT/$FILE$END1.$READFORMAT.gz $INPUT/$FILE$END2.$READFORMAT.gz
done
#######################################
Execute this shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
STAR (Dobin et al., 2013) is a powerful RNA-seq aligner tool that will map RNA-seq reads to a reference genome. STAR can accept FASTA sequence files as genome input and GTF (or GFF3) annotation files as genome annotation input.
| FASTA Genome Examples | GTF Annotation Examples |
|---|---|
| GRCh38.p13.genome.fa | gencode.v34.annotation.gtf |
| lncipedia_5_2_hg38.gtf | |
| GCF_015227675.2_mRatBN7.2_genomic.fna | GCF_015227675.2_mRatBN7.2_genomic.gtf |
It is imperative that the FASTA chromosome names match the GTF chromosome names. To avoid issues regarding chromosome naming differences, try to use files from the same source together (e.g., ENSEMBL FASTA files with ENSEMBL GTF files, etc.).
Full reference genome assemblies are best downloaded from the International Genome Reference Consortium (GRC) through RefSeq (form: GCF_000001405) or NCBI’s GenBank (form: GCA_000001405). Both versions are available through GENCODE for many organism assemblies.
Genesets for Homo sapiens can be easily downloaded from a variety of reputable organizations and institutions, including:
Although these files often come compressed (.gz), they must be decompressed for use by tools in this pipeline. To do this, use gunzip.
gunzip GRCh38.p13.genome.fa.gz
gunzip gencode.v38.annotation.gtf.gz
This step need only be performed ONCE per genome assembly; it can be skipped if a STAR genome index for a particular version of a genome assembly of interest has previously been generated. By indexing the genome, STAR can access reference genome information much more efficiently, thus running quicker.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=50g,vmem=50g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
GENOMEINDEX=/path/to/star/genome/indices/folder
GENOMESEQ=/path/to/genome/fasta/file.fa
#######################################
#######################################
# DO NOT EDIT
STAR --runThreadN 32 --runMode genomeGenerate --genomeDir $GENOMEINDEX --genomeFastaFiles $GENOMESEQ
samtools faidx $GENOMESEQ
#######################################
First, write an R file (.r) that will be used to calculate the normalized expression values for each gene – as annotated by whichever genome annotation you supply – in reads per kilobase of transcript, per million mapped reads (RPKM). In this way, we effectively normalize our samples for sequencing depth and gene length.
Note the path where this R file is saved, as it will need to be referenced in the following shell script. The contents of this file are shown below. Edit the paths and file names where indicated.
#######################################
# EDIT VARIABLES TO MATCH THOSE BY THE SAME NAME IN THE SHELL SCRIPT BUT IN QUOTES HERE
OUTPUT <- "/path/to/desired/output/folder"
ANNOTYPE <- "" # e.g., GENCODE, LNCipedia, etc.
ATTRIBUTE <- "" # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, etc.
PREFIX <- "" # the shared prefix of all raw data being processed in this run; e.g., SRR, PM_C28_WT, etc.
#######################################
#######################################
# DO NOT EDIT
setwd(OUTPUT)
dirs <- dir(OUTPUT, pattern=c(PREFIX,"[[:print:]]+$"))
fileName <- file.path(dirs, paste0(dirs, paste0(".",ANNOTYPE,".",ATTRIBUTE,".base.counts.txt")))
newFileName <- file.path(dirs, paste0(dirs, paste0(".",ANNOTYPE,".table_rpkm.txt")))
for (i in seq(along=fileName)) {
d <- read.table(fileName[i],sep="\t",header=T)
row.names(d) <- d$Geneid
# STORE GENE LENGTHS COLUMN OF COUNTS TABLE
l <- d[,6]
# SUM COUNTS COLUMNS TO GET TOTAL MAPPED READS PER SAMPLE
cS <- colSums(d[, 7:ncol(d), drop=F])
# STORE A MATRIX OF THE READ COUNTS PER GENE
d <- d[,-c(1:6)]
# CALCULATE RPKMS FOR EACH GENE (ASSUMING GENE LENGTHS ARE NOT KB)
rpkm <- (10^9)*t(t(d/l)/cS)
# RESTORE COLUMN NAME
colnames(rpkm)[1] <- names(cS)
write.table(rpkm,file=newFileName[i],sep="\t",quote=F,row.names=F)
}
#######################################
Next, write a shell file (.sh) with the contents shown below. Edit the paths and file names where indicated. Generally, FEATURE=exon and ATTRIBUTE=gene_id are used; only change the inputs for these variables if you know what you are doing.
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
module load rna-seqc/2.0.0
module load subread/2.0.0
module load R/3.6.1
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
CELLTYPE= # e.g., RPE1_Control, THP1_PMAtreated, etc.
INPUT=/path/to/raw/seq/reads/folder
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/output/folder
GENOMEINDEX=/path/to/star/genome/indices/folder
ANNOTATION=/path/to/genome/annotation/file.gtf
ANNOTYPE= # e.g., GENCODE, LNCipedia, etc.
FEATURE= # check column 3 of annotation file for desired feature to count reads to; e.g., exon, transcript, gene, CDS, UTR, etc.
ATTRIBUTE= # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, etc.
RPKMscript=/path/to/Rscript/file.r
SAMPLE=sampleA # must match the first sample in your dataset
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
mkdir $OUTPUT
mkdir $OUTPUT/$FILE
# MAP READS WITH STAR
STAR --runThreadN 32 \
--genomeDir $GENOMEINDEX \
--sjdbGTFfile $ANNOTATION \
--readFilesIn $INPUT/$FILE.$READFORMAT.gz \
--outFileNamePrefix $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_ \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand gunzip -c
# INDEX MAPPED READ FILES WITH SAMTOOLS
samtools index $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# QUALITY CONTROL OF MAPPED READ FILES WITH RNASEQC
rna-seqc $ANNOTATION $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam $OUTPUT/$FILE/$FILE.$ANNOTYPE.rnaseqc_report --coverage --rpkm --sample $FILE.$ANNOTYPE.rnaseqc
# DETERMINE THE NUMBER OF READS MAPPED TO EACH GENE OF THE GENOME WITH FEATURECOUNTS
featureCounts -T 32 --ignoreDup -t $FEATURE -g $ATTRIBUTE -a $ANNOTATION -o $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# REMOVE HEADER LINE FROM COUNTS FILE
echo "$(tail -n +2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt)" > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# NAME THE COUNTS COLUMN
sed -i "1s/\/hpf.*bam/$FILE.Counts/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# ISOLATE THE COUNTS COLUMN
cut -f7 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.countsonly.txt
# COPY GENEID DATA TO A SEPARATE FILE
awk '{print $1}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$ANNOTYPE.gene_ids.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENEID DATA
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.consensus.txt
# CALCULATE THE NORMALIZED EXPRESSION VALUES FOR EACH GENE
Rscript $RPKMscript
# NAME THE RPKM COLUMN
sed -i "1s/X.*bam/$FILE.RPKM/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt
# ISOLATE THE RPKM COLUMN
cut -f2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.rpkmonly.txt
# COPY GENE POSITION INFORMATION TO A SEPARATE FILE
awk 'BEGIN {OFS="\t"}; {print $2,$3,$4,$5}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# REMOVE HEADER LINE FROM GENOMIC POSITION FILE
echo "$(tail -n +2 $OUTPUT/featureCoordTemp.$ANNOTYPE.txt)" > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# PARSE AND SIMPLIFY GENOMIC POSITION INFORMATION
awk 'BEGIN {OFS="\t"}; {
split($1,array,";")
num1=split($2,array1,";")
num2=split($3,array2,";")
min=array1[1]
for(i=2;i<=num1;i++){
min=(min<array1[i]?min:array1[i])
}
max=array2[1]
for(i=2;i<=num2;i++){
max=(max>array2[i]?max:array2[i])
}
split($4,array3,";")
print array[1],min,max,array3[1]
}' $OUTPUT/featureCoordTemp.$ANNOTYPE.txt > $OUTPUT/featureCoord.$ANNOTYPE.txt
# REMOVE TEMPORARY GENOMIC POSITION FILE
rm $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# ADD ALL RPKM COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.rpkmonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.RPKM.genPos.consensus.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.genPos.consensus.txt
done
#######################################
Finally, execute the newly written shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
#!/bin/bash
#PBS -l walltime=23:59:00
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g,vmem=100g
#PBS -m e
#######################################
# DO NOT EDIT
module load star/2.7.0f
module load samtools/1.9
module load rna-seqc/2.0.0
module load subread/2.0.0
module load R/3.6.1
#######################################
#######################################
# EDIT PATHS AND FILE NAMES AFTER: "="
CELLTYPE= # e.g., RPE1_Control, THP1_PMAtreated, etc.
INPUT=/path/to/raw/seq/reads/folder
END1= # paired-end 1 file suffix; e.g., _1, _r1, _R1, .1, .r1, .R1, etc.
END2= # paired-end 2 file suffix; e.g., _2, _r2, _R2, .2, .r2, .R2, etc.
READFORMAT= # e.g., fastq, fq, fasta, fa, etc.
OUTPUT=/path/to/desired/output/folder
GENOMEINDEX=/path/to/star/genome/indices/folder
ANNOTATION=/path/to/genome/annotation/file.gtf
ANNOTYPE= # e.g., GENCODE, LNCipedia, etc.
FEATURE= # check column 3 of annotation file for desired feature to count reads to; e.g., exon, transcript, gene, CDS, UTR, etc.
ATTRIBUTE= # check column 9 of annotation file for desired gene identifier; e.g., gene_id, gene_name, gene, transcript_id, product, exon_id, etc.
RPKMscript=/path/to/Rscript/file.r
SAMPLE=sampleA # must match the first sample in your dataset
# LIST ALL SAMPLE NAMES SEPARATED BY SPACES AFTER: "for FILE in "
for FILE in sampleA sampleB sampleC
#######################################
#######################################
# DO NOT EDIT
do
mkdir $OUTPUT
mkdir $OUTPUT/$FILE
# MAP READS WITH STAR
STAR --runThreadN 32 \
--genomeDir $GENOMEINDEX \
--sjdbGTFfile $ANNOTATION \
--readFilesIn $INPUT/$FILE$END1.$READFORMAT.gz $INPUT/$FILE$END2.$READFORMAT.gz \
--outFileNamePrefix $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_ \
--quantMode GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--readFilesCommand gunzip -c
# INDEX MAPPED READ FILES WITH SAMTOOLS
samtools index $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# QUALITY CONTROL OF MAPPED READ FILES WITH RNASEQC
rna-seqc $ANNOTATION $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam $OUTPUT/$FILE/$FILE.$ANNOTYPE.rnaseqc_report --coverage --rpkm --sample $FILE.$ANNOTYPE.rnaseqc
# DETERMINE THE NUMBER OF READS MAPPED TO EACH GENE OF THE GENOME WITH FEATURECOUNTS
featureCounts -T 32 --ignoreDup -p -t $FEATURE -g $ATTRIBUTE -a $ANNOTATION -o $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt $OUTPUT/$FILE/$FILE\_$ANNOTYPE\annotation_Aligned.sortedByCoord.out.bam
# REMOVE HEADER LINE FROM COUNTS FILE
echo "$(tail -n +2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt)" > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# NAME THE COUNTS COLUMN
sed -i "1s/\/hpf.*bam/$FILE.Counts/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt
# ISOLATE THE COUNTS COLUMN
cut -f7 $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.$ATTRIBUTE.countsonly.txt
# COPY GENEID DATA TO A SEPARATE FILE
awk '{print $1}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.counts.txt > $OUTPUT/$ANNOTYPE.gene_ids.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENEID DATA
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.consensus.txt
# CALCULATE THE NORMALIZED EXPRESSION VALUES FOR EACH GENE
Rscript $RPKMscript
# NAME THE RPKM COLUMN
sed -i "1s/X.*bam/$FILE.RPKM/" $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt
# ISOLATE THE RPKM COLUMN
cut -f2 $OUTPUT/$FILE/$FILE.$ANNOTYPE.table_rpkm.txt > $OUTPUT/$FILE/$FILE.$ANNOTYPE.rpkmonly.txt
# COPY GENE POSITION INFORMATION TO A SEPARATE FILE
awk 'BEGIN {OFS="\t"}; {print $2,$3,$4,$5}' $OUTPUT/$SAMPLE/$SAMPLE.$ANNOTYPE.$ATTRIBUTE.base.counts.txt > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# REMOVE HEADER LINE FROM GENOMIC POSITION FILE
echo "$(tail -n +2 $OUTPUT/featureCoordTemp.$ANNOTYPE.txt)" > $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# PARSE AND SIMPLIFY GENOMIC POSITION INFORMATION
awk 'BEGIN {OFS="\t"}; {
split($1,array,";")
num1=split($2,array1,";")
num2=split($3,array2,";")
min=array1[1]
for(i=2;i<=num1;i++){
min=(min<array1[i]?min:array1[i])
}
max=array2[1]
for(i=2;i<=num2;i++){
max=(max>array2[i]?max:array2[i])
}
split($4,array3,";")
print array[1],min,max,array3[1]
}' $OUTPUT/featureCoordTemp.$ANNOTYPE.txt > $OUTPUT/featureCoord.$ANNOTYPE.txt
# REMOVE TEMPORARY GENOMIC POSITION FILE
rm $OUTPUT/featureCoordTemp.$ANNOTYPE.txt
# ADD ALL RPKM COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.rpkmonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.RPKM.genPos.consensus.txt
# ADD ALL COUNTS COLUMNS FOR EACH SAMPLE TO THE GENOMIC POSITION INFORMATION
paste $OUTPUT/$ANNOTYPE.gene_ids.txt $OUTPUT/featureCoord.$ANNOTYPE.txt $OUTPUT/*/*.$ANNOTYPE.$ATTRIBUTE.countsonly.txt > $OUTPUT/$CELLTYPE.$ANNOTYPE.counts.genPos.consensus.txt
done
#######################################
Finally, execute the newly written shell file by submitting it as a job to the scheduler. To do this, use qsub.
qsub <fileName>.sh
At the top of most scripts in this protocol, where you are supposed to curate the variables to suit your data, make sure you are adhering to the these strict rules:
For shell script variables, type immediately following the “=” sign. Do NOT add a space between the variable name and your text, and do NOT put your text in single or double quotes. This applies to strings and file/folder paths.
For R script variables, type between the double quotes that follow the “<-” sign. Do NOT add a space between the opening or closing quotation mark and your text. This applies to strings and file/folder paths. Double-check whether you have accidentally included extra single or double quotes within the required set already in the template.
In the R script, you must specify the prefix shared by all samples you wish to analyze in this job. The more specific (/longer) the shared prefix, the better. This is particularly important if you are analyzing a new subset of samples and storing their results in an output folder you have previously stored results in. You do not want RPKMs to be (re-)calculated for all samples in that folder, rather, only those files part of this latest job which have a shared prefix that is distinct from previously analyzed samples in the same destination folder.
Andrews, S. (2010). FASTQC. A quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed 16 September 2021.
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635